Red Wine Exploration by Darui Zhang
What property makes great red wine great? In this project we try to answer this question by exploring the red wine data set.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Univariate Plots Section
Feature Names and Summary
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Quality Distribution
The wine quality grade is a discrete number. It is ranged from 3 to 8. The median value is at 6.

Distribution of Other Chemical Properties
## Warning: position_stack requires constant width: output may be incorrect

Univariate Analysis
Some observed on the distribution of the chemical property can be made:
Normal: Volatile acidity, Density, PH
Positively Skewed: Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol
Long Tail: Residual sugar, Chlorides
Rescale Variable
The skewed and long tail data can be transformed toward more normally distribution by taking square root or log function. Take Sulphates as a example, we compare the original, square root and log of the feature.

Both the square root and the log function helps transform the feature toward normal distribution. In comparison, the log scale feature is more normal distributed.
Bivariate Plots Section
Bivariate Plots Selection
Plot matrix was used to have a glance at the data. We are interested the correlation between the wine quality and each chemical property.

The top 4 factor that is correlated with the wine quality (with a correlation value greater than 0.2)
| alcohol |
0.476 |
| volatile.acidity |
-0.391 |
| sulphates |
0.251 |
| citric.acid |
0.226 |
Bivariate Analysis
Alcohol content has the biggest correlation value to the wine quality. The scatter plot of alcohol and wine quality is shown below.

The original plot looks over plotted, so we add alpha value and 0.1, 0.5 and 0.9 percentile line to show the general trends.

In this plot the trend of increasing wind quality with the increasing of alcohol content can be clearly observed.
Distribution Analysis
In this analysis, we try to find if the distribution of the chemical properties are different in each wine quality.

Note that sine the data size for each quality is not equal, the distribution of higher and lower grades are hard to see.
A normalized plot is shown below.

The plot looks a little busy. We ground 2 grade together: grade 3,4 as “Low”, grade 5,6 as “Medium”, grade 7,8 as “High”. And plot again.

The new plot looks cleaner.
Similar analysis was done the 3 other factors: volatile acidity, sulphates and citric.acid


As stated in section 1 the sulphates data is skewed, we tried both the original and the log scale of the feature.


The log scaled feature looks better.
Correlation Between Features
There is interesting correlaiton between two of the main features: Volatile acidity and Citric acid.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

##
## Pearson's product-moment correlation
##
## data: redwine$volatile.acidity and redwine$citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Multivariate Plots Section
Main Chemical Property vs Wine Quality
With different color, we can add another dimension into the plot. There are 4 main features.Alcohol, volatile acidity are the top two factor that affect wine quality.

The figure looks over ploted, since the wine quality is discrete numbers. The use jitter plot to alleviate this problem

We can see higher quality wine have alcohol and lower volatile acidity.
Add Another Feature
Now we add the third feature, the log scale of sulphates, and use different facet to show wine grade.

We can see higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates.
Main Chemical Properties vs Wine Quality
Since we can visualized 3 dimensions, including wine quality, at a time. Two graphs will be needed to visualize the 4 main chemical properties.

The same trend of alcholand volatile acidity’s effect on wind qaulity can be observed.

We can see higher quality wine have higher sulphates (x-axis), higher citric acidity (y-axis).
Linear Multivariable Model
Linear Multivariable model was created to predict the wine quality based on chemical properties.
The features are selected incrementally in order of how strong the correlation between this feature and wine quality.
##
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = redwine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = redwine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates,
## data = redwine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid, data = redwine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides, data = redwine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide, data = redwine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide + density,
## data = redwine)
##
## ==================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------
## (Intercept) 6.566*** 3.095*** 2.611*** 2.646*** 2.769*** 2.985*** -0.953
## (0.058) (0.184) (0.196) (0.201) (0.202) (0.206) (11.990)
## volatile.acidity -1.761*** -1.384*** -1.221*** -1.265*** -1.155*** -1.104*** -1.114***
## (0.104) (0.095) (0.097) (0.113) (0.115) (0.115) (0.120)
## alcohol 0.314*** 0.309*** 0.309*** 0.292*** 0.276*** 0.280***
## (0.016) (0.016) (0.016) (0.016) (0.017) (0.020)
## sulphates 0.679*** 0.696*** 0.871*** 0.908*** 0.903***
## (0.101) (0.103) (0.111) (0.111) (0.112)
## citric.acid -0.079 0.021 0.065 0.044
## (0.104) (0.106) (0.106) (0.124)
## chlorides -1.663*** -1.763*** -1.747***
## (0.405) (0.403) (0.406)
## total.sulfur.dioxide -0.002*** -0.002***
## (0.001) (0.001)
## density 3.923
## (11.944)
## --------------------------------------------------------------------------------------------------
## R-squared 0.153 0.317 0.336 0.336 0.343 0.352 0.352
## adj. R-squared 0.152 0.316 0.335 0.334 0.341 0.349 0.349
## sigma 0.744 0.668 0.659 0.659 0.656 0.651 0.652
## F 287.444 370.379 268.912 201.777 166.407 143.910 123.298
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1794.312 -1621.814 -1599.384 -1599.093 -1590.662 -1580.192 -1580.138
## Deviance 883.198 711.796 692.105 691.852 684.595 675.689 675.643
## AIC 3594.624 3251.628 3208.768 3210.186 3195.324 3176.384 3178.276
## BIC 3610.756 3273.136 3235.654 3242.448 3232.964 3219.401 3226.670
## N 1599 1599 1599 1599 1599 1599 1599
## ==================================================================================================
The model of 6 features has the lowest AIC (Akaike information criterion) number. As the number of features increasing the AIC becomes higher. The parameter of the predictor also changed dramatically which shows a sign of overfitting.
Final Plots and Summary
Plot One

Description One
Plot Two
Description Two
Plot Three

Description Three
Reflection